Skip to content

[Feature #21943] Add StringScanner#integer_at#193

Open
jinroq wants to merge 41 commits intoruby:masterfrom
jinroq:add_integer_at
Open

[Feature #21943] Add StringScanner#integer_at#193
jinroq wants to merge 41 commits intoruby:masterfrom
jinroq:add_integer_at

Conversation

@jinroq
Copy link
Copy Markdown

@jinroq jinroq commented Mar 6, 2026

The specification for MatchData#integer_at has been defined here. StringScanner#integer_at follows this specification.

see: https://bugs.ruby-lang.org/issues/21932#note-6, https://bugs.ruby-lang.org/issues/21932#note-7
see: https://bugs.ruby-lang.org/issues/21943

return new_ary;
}

#ifdef HAVE_RB_INT_PARSE_CSTR
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on:
- push
- pull_request
- workflow_dispatch
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- workflow_dispatch

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9e0d504 fixed it.

have_func("onig_region_memsize(NULL)")
have_func("rb_reg_onig_match", "ruby/re.h")
have_func("rb_deprecate_constant")
have_func("rb_int_parse_cstr")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strscan requires Ruby 2.4 or later.
What is the minimum Ruby version to use rb_int_parse_cstr()?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rb_int_parse_cstr has been available since Ruby 2.5.0. In Ruby 2.4, it is detected using have_func, and if it is not available, it falls back to rb_str_to_inum.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

#ifdef HAVE_RB_INT_PARSE_CSTR
VALUE rb_int_parse_cstr(const char *str, ssize_t len, char **endp,
size_t *ndigits, int base, int flags);
#define RB_INT_PARSE_SIGN 0x01
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ruby/ruby#16322 is merged, this will report a duplicated definition warning.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1630df8 fixed it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit rb_int_parse_cstr() prototype and RB_INT_PARSE_SIGN definition entirely when Ruby provides them?

rb_define_method(StringScanner, "size", strscan_size, 0);
rb_define_method(StringScanner, "captures", strscan_captures, 0);
rb_define_method(StringScanner, "values_at", strscan_values_at, -1);
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);
rb_define_method(StringScanner, "integer_at", strscan_integer_at, 1);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1630df8 fixed it.

@jinroq jinroq requested a review from kou March 6, 2026 16:43
@eregon
Copy link
Copy Markdown
Member

eregon commented Mar 6, 2026

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

@kou Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?
In fact on TruffleRuby StringScanner uses MatchData objects internally.

I think it would be better to expose the MatchData object than keep defining methods similar to MatchData but with slightly different names. I think it makes it harder to learn the StringScanner API (i.e., it would be smaller and easier to approach if it didn't duplicate many MatchData methods).

Among all StringScanner instance methods:

  <<, [], beginning_of_line?, captures, charpos, check, check_until,
  concat, eos?, exist?, fixed_anchor?, get_byte, getch, initialize_copy,
  inspect, match?, matched, matched?, matched_size, named_captures,
  peek, peek_byte, pointer, pointer=, pos, pos=, post_match, pre_match,
  reset, rest, rest_size, scan, scan_byte, scan_integer, scan_until,
  size, skip, skip_until, string, string=, terminate, unscan, values_at

These are just doing the same on the MatchData:

  [], captures
  matched, matched?,
  matched_size (same as `byteend(0) - bytebegin(0)`, named_captures
  post_match, pre_match
  size, string, values_at

And these are MatchData methods which StringScanner doesn't have:

  begin, bytebegin, byteend, byteoffset, deconstruct,
  deconstruct_keys, end, length, match,
  match_length, names, offset,
  regexp, to_a

@eregon
Copy link
Copy Markdown
Member

eregon commented Mar 6, 2026

If https://bugs.ruby-lang.org/issues/21932 gets merged it seems cleaner to reuse that than to reimplement it.

Mmh, but that likely wouldn't achieve as good a speedup as the current approach in the context of https://bugs.ruby-lang.org/issues/21943 as it would mean an extra MatchData allocation.
The strscan extension seems to save the matched captures but not a MatchData object:

/* the regexp register; legal only when MATCHED_P(s) */
struct re_registers regs;

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

def test_integer_at_large_number
huge = '9' * 100
s = create_string_scanner(huge)
s.scan(/(#{huge})/)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s.scan(/(#{huge})/)
s.scan(/(\d+)/)

end

def test_integer_at_leading_zeros
s = create_string_scanner("007")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

007 is not a good data for this because 007 is valid both for base=10 and base=8. Do we need this test?

Comment on lines +1070 to +1078
# "09" would be invalid in octal, but integer_at always uses base 10
s = create_string_scanner("09")
s.scan(/(\d+)/)
assert_equal(9, s.integer_at(1))

# "010" is 8 in octal (Integer("010")), but 10 in base 10
s = create_string_scanner("010")
s.scan(/(\d+)/)
assert_equal(10, s.integer_at(1))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both of them? Can they to catch any different problem?

have_func("onig_region_memsize(NULL)")
have_func("rb_reg_onig_match", "ruby/re.h")
have_func("rb_deprecate_constant")
have_func("rb_int_parse_cstr")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Can we use rb_cstr_parse_inum() with Ruby 2.4?

Comment on lines +1944 to +1958
long j = 0;
if (ptr[0] == '-' || ptr[0] == '+') j = 1;
if (j >= len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
(int)len, ptr);
}
for (; j < len; j++) {
if (ptr[j] < '0' || ptr[j] > '9') {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
(int)len, ptr);
}
}
return rb_str_to_inum(rb_str_new(ptr, len), 10, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this accept 1_234?
See also: #192 (comment)

Comment on lines +1894 to +1913
GET_SCANNER(self, p);
if (! MATCHED_P(p)) return Qnil;

switch (TYPE(idx)) {
case T_SYMBOL:
idx = rb_sym2str(idx);
/* fall through */
case T_STRING:
RSTRING_GETMEM(idx, name, i);
i = name_to_backref_number(&(p->regs), p->regex, name, name + i, rb_enc_get(idx));
break;
default:
i = NUM2LONG(idx);
}

if (i < 0)
i += p->regs.num_regs;
if (i < 0) return Qnil;
if (i >= p->regs.num_regs) return Qnil;
if (p->regs.beg[i] == -1) return Qnil;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copied this from strscan_aref(), right? Can we share common code with strscan_aref() and strsacn_integer_at()?

end = adjust_register_position(p, p->regs.end[i]);
len = end - beg;

if (len <= 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use == 0 here?
len may be negative?

len = end - beg;

if (len <= 0) {
rb_raise(rb_eArgError, "empty capture for integer conversion");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rb_raise(rb_eArgError, "empty capture for integer conversion");
rb_raise(rb_eArgError, "specified capture is empty: %"PRIsVALUE, idx);


if (endp != ptr + len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other reason on failure?


if (endp != ptr + len) {
rb_raise(rb_eArgError,
"non-digit character in capture: %.*s",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target string has a trailing space, it's difficult to find a problem. How about surround the target string something like the following?

Suggested change
"non-digit character in capture: %.*s",
"non-digit character in capture: <%.*s>",

@kou
Copy link
Copy Markdown
Member

kou commented Mar 7, 2026

Do you know why StringScanner has an interface very similar to MatchData but yet doesn't expose the MatchData object?

No. But if we create a MatchData, it causes performance overhead, right? (StringScanner doesn't use MatchData internally.) It'll reduce a merit of this optimization.

BTW the presense of StringScanner.must_C_version makes me wonder, was StringScanner once written in Ruby?

Yes. But it's before StringScanner was imported to Ruby itself.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

@eregon
Copy link
Copy Markdown
Member

eregon commented Mar 7, 2026

No. But if we create a MatchData, it causes performance overhead, right?

Yeah, and I guess that's the main reason StringScanner directly exposes MatchData-like methods.
StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

FYI: https://i.loveruby.net/ja/projects/strscan/doc/ChangeLog.html (Japanese)

Interesting, thank you for the link.

@kou
Copy link
Copy Markdown
Member

kou commented Mar 7, 2026

StringScanner could still have a new method to return a MatchData, so MatchData methods which are not mirrored in StringScanner could be used.

Yes. But it's out-of-scope of this.

jinroq added 2 commits March 18, 2026 22:04
Add a method that returns a captured substring as an Integer,
following String#to_i(base) semantics. Accepts an optional base
argument (default 10), Symbol/String for named capture groups,
and returns 0 for non-numeric or empty captures.

Extract resolve_capture_index helper to share index resolution
logic between StringScanner#[] and StringScanner#integer_at.
When base is 10 and the capture contains only digits (with optional
sign) that fit in long, parse directly and return via LONG2NUM.
This covers the Date._strptime use case without temporary String
creation. All other cases fall through to rb_str_to_inum.
Provide a pure Ruby implementation using self[index].to_i(base)
for JRuby and other non-CRuby platforms. The C extension version
takes precedence when available.
@jinroq
Copy link
Copy Markdown
Author

jinroq commented Mar 22, 2026

@kou
This comment finalizes the specification for MatchData#integer_at(n). StringScanner#integer_at(n) has also been updated to comply with this specification. Please review it.

@jinroq jinroq requested a review from kou March 22, 2026 15:42
* This covers the Date._strptime use case. */
if (base == 10) {
long j = 0;
int negative = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use bool instead of int for boolean?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

485f27c fixed it.

Comment on lines +15 to +24
unless method_defined?(:integer_at)
# Fallback implementation for platforms without C extension (e.g. JRuby).
# Equivalent to self[index].to_i(base).
def integer_at(index, base = 10)
str = self[index]
return nil if str.nil?
str.to_i(base)
end
end

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't split #scan_integer documentation and implementation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ef6281d fixed it.

}
}
if (all_digits) {
if (digit_count <= (sizeof(long) >= 8 ? 18 : 9)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It seems that 9223372036854775807 (max int64_t value) isn't optimized. Is it intentional?
  • It seems that 00000000000000000001 isn't optimized. Is it intentional?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not my intention. 6052143 fixed it.


def test_integer_at_index_zero
s = create_string_scanner("42 abc")
s.scan(/(\d+)/)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need (...) here, right?

Suggested change
s.scan(/(\d+)/)
s.scan(/\d+/)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f64fdd8 fixed it.

assert_equal({"number" => "1"}, scan.named_captures)
end

def test_integer_at
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use test_integer_at_XXX like other methods?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6a55781 fixed it.

Comment on lines +1121 to +1125
def test_integer_at_named_capture_undefined
s = create_string_scanner("2024-06-15")
s.scan(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/)
assert_raise(IndexError) { s.integer_at(:unknown) }
assert_raise(IndexError) { s.integer_at("unknown") }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use unknown for both of test name and test value?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4d3583f fixed it.

Comment on lines +1144 to +1149
def test_integer_at_underscore
# follows String#to_i: underscores are accepted
s = create_string_scanner("1_0_0")
s.scan(/(\d+(?:_\d+)*)/)
assert_equal(100, s.integer_at(1))
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we optimize this case too?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e686711 fixed it.

assert_equal(999999999999999999, s.integer_at(1))

# 19 digits: exceeds long on 64-bit, becomes bignum
s = create_string_scanner("9999999999999999999")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should use border values for testing. If "9" * 18 is the largest optimized value, we should use "9" * 18" and "1" * 19" (the next value of "9" * 18").

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0f2ad2a fixed it.


/*
* call-seq:
* integer_at(index, base = 10) -> integer or nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use specifier not index like we did for []?

strscan/ext/strscan/strscan.c

Lines 1625 to 1695 in 4243751

/*
*
* :markup: markdown
* :include: strscan/link_refs.txt
*
* call-seq:
* [](specifier) -> substring or nil
*
* Returns a captured substring or `nil`;
* see [Captured Match Values][13].
*
* When there are captures:
*
* ```rb
* scanner = StringScanner.new('Fri Dec 12 1975 14:39')
* scanner.scan(/(?<wday>\w+) (?<month>\w+) (?<day>\d+) /)
* ```
*
* - `specifier` zero: returns the entire matched substring:
*
* ```rb
* scanner[0] # => "Fri Dec 12 "
* scanner.pre_match # => ""
* scanner.post_match # => "1975 14:39"
* ```
*
* - `specifier` positive integer. returns the `n`th capture, or `nil` if out of range:
*
* ```rb
* scanner[1] # => "Fri"
* scanner[2] # => "Dec"
* scanner[3] # => "12"
* scanner[4] # => nil
* ```
*
* - `specifier` negative integer. counts backward from the last subgroup:
*
* ```rb
* scanner[-1] # => "12"
* scanner[-4] # => "Fri Dec 12 "
* scanner[-5] # => nil
* ```
*
* - `specifier` symbol or string. returns the named subgroup, or `nil` if no such:
*
* ```rb
* scanner[:wday] # => "Fri"
* scanner['wday'] # => "Fri"
* scanner[:month] # => "Dec"
* scanner[:day] # => "12"
* scanner[:nope] # => nil
* ```
*
* When there are no captures, only `[0]` returns non-`nil`:
*
* ```rb
* scanner = StringScanner.new('foobarbaz')
* scanner.exist?(/bar/)
* scanner[0] # => "bar"
* scanner[1] # => nil
* ```
*
* For a failed match, even `[0]` returns `nil`:
*
* ```rb
* scanner.scan(/nope/) # => nil
* scanner[0] # => nil
* scanner[1] # => nil
* ```
*
*/

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

433fd87 fixed it.

VALUE idx, vbase;
int base = 10;

rb_scan_args(argc, argv, "11", &idx, &vbase);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use specifier not idx?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

433fd87 fixed it.

jinroq added 12 commits March 23, 2026 20:48
Skip leading zeros to compute effective digit count, allowing values like "00000000000000000001" to use the fast path. Add overflow-checked parsing for 19-digit values so LONG_MAX fits in the fast path while LONG_MAX+1 correctly falls through to rb_str_to_inum.
Remove nested capture group and check group 3 directly for nil.
Non-digit behavior is already covered by test_integer_at_non_digit and index 0 is covered by test_integer_at_index_zero.
Extend base-10 fast path to parse underscore-separated digits(e.g. "1_000_000") without temporary String allocation, following String#to_i underscore rules.
@jinroq
Copy link
Copy Markdown
Author

jinroq commented Mar 31, 2026

@kou

Could you also add a test when the current position isn't 0?

6b44dbd fixed it.

@jinroq jinroq requested a review from kou March 31, 2026 14:07
@kou kou requested a review from Copilot April 1, 2026 05:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1965 to +1973
bool overflow = false;
for (; j < len; j++) {
if (ptr[j] != '_') {
unsigned long d = ptr[j] - '0';
result = result * 10 + d;
if (result > limit) {
overflow = true;
break;
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parse_decimal_fast, the overflow-check path uses unsigned long result and performs result = result * 10 + d before comparing to limit. On 32-bit platforms, unsigned long can overflow/wrap during the multiply/add for inputs > ULONG_MAX, which can make result > limit unreliable and return an incorrect (wrapped) value instead of falling back to rb_str_to_inum (e.g., parsing a 10-digit value like "9999999999"). Consider using a wider intermediate type (e.g., unsigned long long) or doing a pre-multiply overflow check (result > limit/10 etc.) before updating result.

Suggested change
bool overflow = false;
for (; j < len; j++) {
if (ptr[j] != '_') {
unsigned long d = ptr[j] - '0';
result = result * 10 + d;
if (result > limit) {
overflow = true;
break;
}
unsigned long limit_div10 = limit / 10;
unsigned long limit_mod10 = limit % 10;
bool overflow = false;
for (; j < len; j++) {
if (ptr[j] != '_') {
unsigned long d = (unsigned long)(ptr[j] - '0');
if (result > limit_div10 ||
(result == limit_div10 && d > limit_mod10)) {
overflow = true;
break;
}
result = result * 10 + d;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e6e5c27 fixed it

int base = 10;

rb_scan_args(argc, argv, "11", &specifier, &vbase);
if (!NIL_P(vbase)) base = NUM2INT(vbase);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer_at currently treats an explicit nil second argument the same as omitting the argument (base remains 10 because of if (!NIL_P(vbase)) ...). This diverges from the documented equivalence to self[specifier].to_i(base), where passing nil for base raises a TypeError (and it also diverges from the pure-Ruby fallback, which would call to_i(nil) and raise). Consider changing the base handling to distinguish “argument not provided” vs “provided as nil” (e.g., check argc instead of NIL_P).

Suggested change
if (!NIL_P(vbase)) base = NUM2INT(vbase);
if (argc > 1) base = NUM2INT(vbase);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ba15508 fixed it.

assert_equal(0, s.integer_at(1)) # base 10: "0xF".to_i => 0
assert_equal(15, s.integer_at(1, 0)) # base 0: "0xF".to_i(0) => 15
end

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integer_at accepts an optional base argument, but the new tests don’t currently cover error cases for that parameter (notably base = nil, which should behave the same as String#to_i(nil) / raise TypeError if the implementation follows the stated self[specifier].to_i(base) equivalence). Adding a regression test here would help ensure the C and Ruby fallback implementations stay aligned.

Suggested change
def test_integer_at_base_nil
s = create_string_scanner("2024")
s.scan(/(\d+)/)
assert_raise(TypeError) { s.integer_at(1, nil) }
end

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#193 (comment) fixed it.

Comment on lines +1868 to +1895
/*
* call-seq:
* integer_at(specifier, base = 10) -> integer or nil
*
* Returns the captured substring at the given +specifier+ as an Integer,
* following the behavior of <tt>String#to_i(base)</tt>.
*
* +specifier+ can be an Integer (positive, negative, or zero), a Symbol,
* or a String for named capture groups.
*
* Returns +nil+ if:
* - No match has been performed or the last match failed
* - The +specifier+ is an Integer and is out of range
* - The group at +specifier+ did not participate in the match
*
* Raises IndexError if +specifier+ is a Symbol or String that does not
* correspond to a named capture group, consistent with
* <tt>StringScanner#[]</tt>.
*
* This is semantically equivalent to <tt>self[specifier].to_i(base)</tt>
* but avoids the allocation of a temporary String when possible.
*
* scanner = StringScanner.new("2024-06-15")
* scanner.scan(/(\d{4})-(\d{2})-(\d{2})/)
* scanner.integer_at(1) # => 2024
* scanner.integer_at(1, 16) # => 8228
*
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this just before static VALUE strscan_integer_at()?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9efb4d2 fixed it.


if (effective_digits <= (sizeof(long) >= 8 ? INT64_DECIMAL_SAFE_DIGITS : INT32_DECIMAL_SAFE_DIGITS)) {
long result = 0;
for (; j < len; j++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use for (k = first_nonzero; k < len; k++) here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f51e1cf fixed it.

? (unsigned long)LONG_MAX + 1
: (unsigned long)LONG_MAX;
bool overflow = false;
for (; j < len; j++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f51e1cf fixed it.

static inline VALUE
parse_decimal_fast(const char *ptr, long len)
{
long j = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use meaningful name instead of j? I think that this is not a loop variable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

764a4a1 fixed it.


/* Validate: only digits and underscores (not leading/trailing/consecutive) */
{
long k;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use i not k?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0f67657 fixed it.

}
if (!overflow) {
if (negative) {
if (result == (unsigned long)LONG_MAX + 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use limit here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7f19293 fixed it.

Comment on lines +30 to +32
str = self[specifier]
return nil if str.nil?
str.to_i(base)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use self[specifier]&.to_i(base)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e4c6a1e fixed it.

if (negative) result = -result;
return LONG2NUM(result);
}
/* One more digit than safe: may still fit in long with overflow check */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of combining validation/counting/safe_digits_branch/one_more_digit_branch to a single for-loop like this?

for (k = j; k < len; k++) { //single moderate-fast path
  if (invalid) stop_parsing_or_fallback_to_slow_path;
  if (result > (limit - d) / 10) fallback_to_slow_path; // always check overflow
  result = result * 10 + d;
}

It always checks overflow, which is an overhead. It always performs multiplication even for > 19 digits case.
But code is simple, less branch, boundary test will be simple too.

Compared to this, the current code is really optimized to remove all the overhead that the simple for-loop has.

  • validation: to eliminate multiplication for invalid case
  • counting: to eliminate useless multiplication for large number of digits case
  • safe_digits_branch: fast path to eliminate overflow check

I just want to know if the performance improvement of these optimizations are really worth adding complexity to the code.
My guess is, a simple and moderate fast path is fast enough for the use case.

If the overflow check overhead needs to be reduced, something like result > (LONG_MAX-9)/10 && result > (limit-d)/10 might be performant.

@jinroq
Copy link
Copy Markdown
Author

jinroq commented Apr 5, 2026

@kou

I’d like to hear your thoughts on this comment.

Regarding Date._strptime, this comment makes a valid point.
Since the year is 4 digits, the month is 2 digits, and the day is 2 digits - totaling 18 digits or fewer - paths other than the safe path (such as the counting and overflow check paths) are rarely reached in actual use cases.
Compared to the two-pass approach currently used in the code (validation + counting -> accumulation), a simple single loop can be expected to achieve comparable performance in most cases.

However, StringScanner#integer_at is not a feature intended solely for Date.

  • Since a limit must be set for inputs with more than 19 digits (as in the test_integer_at_large_number case), the limit must be determined before the negative is finalized, which can ultimately result in code that is at least as complex
  • The current implementation correctly handles all cases involving signs, underscores, and leading zeros; if replaced with a simple loop, these operations would need to be crammed into a single loop.

I don’t think I need to respond.

@kou
Copy link
Copy Markdown
Member

kou commented Apr 5, 2026

Could you add your comment to the thread #193 (comment) for easy to follow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants